Modeling
Lecture 7
Midterm Statistics
Nice work!
- Regrade requests due by next
Friday
- Take a look at the answer
key first
How is this class going for you so far?
How was the midterm?
Anonymous mid-semester evaluation
https://forms.gle/5b9StxauCupWyL8WA
Announcements
- Previously, R scripts that contained lecture examples have
been written as scripts (look like what you would code)
- This was to help you understand what a script should
look like
- Lecture code on ELMS will be in R Markdown form from this
lecture onwards
- Easier to see script and output from a script
- Helpful for reviewing material without having to run
everything
R Markdown
Open the
__.Rmd file
(Or click on
.rmd file and
open with
RStudio)
R Markdown: run code (Ctrl + Alt + R)
R Markdown: Preview
R Markdown
White background with text:
English explanations
Grey background with text:
R code (like your script)
White background
immediately below grey:
R code output (like your
console)
Modeling
- Learn about relationships between variables
- Use value of one variable to predict value in another
- Does NOT tell us anything about causation
- Examples:
- Beer’s law (absorbance vs concentration)
- Disease vs exposure (lung cancer vs smoking)
- BMI vs cholesterol
Linear Regression
y = mx + b
Modeling: Linear Regression
- Beer's law: absorbance = constant * concentration
- It basically just says that absorbance varies linearly with
concentration
- Y = slope * X
- Our dataset: Formaldehyde
(built into R)
- View(Formaldehyde)
- carb = concentration
- optden = absorbance
Before creating a line of best fit, we should first
visually check to see if a graph of optden vs carb
looks linear
Image: https://www.dummies.com/education/math/business-statistics/use-scatter-plots-to-identify-a-linear-
relationship-in-simple-regression-analysis/
We would NOT want to fit a
straight line to this graph
Graphing: plot()
plot(x = ____, y = _____)
- Scatterplot of Y v.s. X
- x and y are vectors
Example:
- From Beer's law, we expect a graph of optden vs carb to
form a line
- plot(x = Formaldehyde$carb, y = Formaldehyde$optden)
Graphing: plot()
Our plot:
Graphing: plot() title, axis labels
plot(x = ____, y = _____, main = “__”, xlab = “__”, ylab = “__”)
- Scatterplot of y v.s. x
- main: title
- xlab: x axis label; ylab: y axis label
Example:
- plot(x = Formaldehyde$carb, y = Formaldehyde$optden,
main = "Beer's Law", xlab = "concentration", ylab =
"absorbance")
Our plot:
We’ll learn how to
make this more
aesthetically
appealing in
another lecture
Your Turn: Formaldehyde
- Formaldehyde$carb is currently in mL (ignore the fact that mL
is a unit for volume)
- Convert to L (1 ml * 0.001 = 1 L) and plot
Formaldehyde$optden vs new concentration vector
- Label the x axis with “Concentration (L)”
Hint:
plot(x = __, y = ___, main = “__”, xlab = “__”, ylab = “__”)
Formaldehyde Absorbance vs Liters
Linear Regression
- Now that we see the data is linear, let’s fit a line!
variable_name <- lm(y ~ x)
- Fits a line for y vs x and stores into variable_name
Example:
linear_model <- lm(Formaldehyde$optden ~ Formaldehyde$carb)
Linear Regression
- To see what the actual linear model is, we can print it
print(variable_name)
- What would give us even more information is calling summary()
summary(variable_name)
Linear Regression
Interpretation of these R results:
y = mx + b
optden = 0.876286 * carb + 0.005086
Linear Regression
Your Turn: Formaldehyde
- Create a similar linear regression model except use
concentration in liters
- Output a summary of that model
Hint:
variable_name <- lm(y ~ x)
summary(variable_name)
Linear Regression
Graphing a Line
plot(x = Formaldehyde$carb, y = Formaldehyde$optden, main =
"Beer's Law", xlab = "concentration", ylab = "absorbance")
abline(linear_model)
- Need to have a plot to
add a line to it
Your Turn: Formaldehyde
- Add your line to the plot!
Hint:
abline(linear_model)
Formaldehyde Plot with Line
abline(linear_model_liters)
Your Turn: heart.csv NOTE: this data is not real
- Plot BMI vs cholesterol. Does this relationship look
linear?
- Does smoking status affect this relationship? In other
words, is the slope of the BMI vs cholesterol line
different for smokers and non-smokers?
Hint:
plot(), abline(), lm(), summary()
This is known as
“stratifying” for smoking
status
Modeling BMI and cholesterol
Modeling BMI and cholesterol
Other Models
- We just learned the lm() to create linear models
There are TONS of models out there, such as logistic
regression, exponential regression, polynomial regression, etc.
- We won't be doing too much with these other types of
regression because everyone is coming to this with a
different level of statistics background.
Other Models
- HOWEVER, know that other models exist, and modeling with another
regression types follows this general sort of principle
- Example, logistic regression:
- lm() replaced with glm()
- Add an extra argument at the end to specify distribution as binomial.
logistic_model <- glm(y_variable ~ x_variable, family = "binomial")
- Don't worry about this if you don't know what logistic regression is, but if you
were interested in other types of models, glm() pretty much has you covered
- Use RDocumentation to learn more
Logistics
New assignment released today, due in two weeks
Please do the mid-semester evaluation!